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Abstract: In this paper we consider association rules mining in tables with quantitative attributes. The 
chain split technique of finding frequent itemsets, known in Boolean association rule mining domain, is 
extended to the “cube-split” technique being used for finding frequent itemsets for the case of specific 
quantitative attributes. 
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1. Introduction 


The goal of the data mining and knowledge discovery is to extract high level information 
relationships in raw big data of applications. Logical-Statistical Association rules (simply association 
rules) are one of the key data mining techniques. An association rule is a logical-statistical relation of the 
form A => B, where A and B are events. The rule states that with a certain probability, called the 
confidence of the rule, when A occurs in the given database so does B. The second important 
characterization of rules is the probability of occurrence of the event A&B over the database records, 
called the rule support. The problem is to find all association rules that satisfy the requirement of 
minimum support and minimum confidence. The main bottlenecks of the rule generation are the speed 
of the algorithms used, the limitation on subset of the rule collections obtained, and the applied value of 
these rules — their interpretation. A well-known interpretation of association rules comes with the 
problem market-basket-data-analysis. 


The structure of this article is as follows: Boolean association rule mining problem is considered 
in Section 2, and a brief description of the APRIORI algorithm, the de facto standard approach for the 
case of binary attributes, - is introduced. Section 3 describes the necessary structural knowledge on the 
discrete grids that is necessary for the functionality of the chain split technique which is the main 
instrument considered in this work. Section 4 is devoted to the analysis of rule mining with quantitative 
attributes. The chain split technique is extended to the case of quantitative attribute rule mining. 
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2. Boolean Association Rule Mining 





About the problem 


Association rule mining in terms of confidence/support was first introduced by [Agrawal et al., 
1993], and later is addressed in [Agrawal et al., 1994]. These first papers consider databases consisting 
of categorical attributes only (a categorical attribute is one which contains discrete, and typically, 
unordered data). Thus the events A and B on both sides of the rule A > B, are logical expressions of 
categorical variables. The aim is to find all rules with confidence and support above the user-defined 
thresholds of these parameters (minconf and minsup). It is desirable that the set of constructed rules is 
sharply limited in size. Then the rules may have a better interpretation. 


Earlier, for the purpose of discovering exact or almost exact rules Piatetsky-Shapiro [Piatetsky- 
Shapiro, 1991] introduced three principles for rule interestingness (0 if the variables are statistically 
independent, monotonically increasing if the variables occur more often together, and monotonically 
decreasing if one of the variables alone occurs more often) defining in this way the base concepts of 
contemporary rule mining. In structural data analysis rules are likely to be exact. Exact rules appear also 
as functional dependencies in relational databases [Armstrong, 1998]. On the other hand, in business 
databases and in very large databases (similar to supermarket transaction databases), rules may be 
descriptive even in approximation, with the confidence much less than 100%. A number of efficient 
algorithms for mining binary association rules have been developed (see [Agrawal et al., 1994], [Mannila 
et al., 1994], [Toivonen, 1996] for just a few examples). And the APRIORI [Agrawal et al., 1996] 
algorithm is known as the de facto standard of Boolean association rule mining. 


Description 


Consider a set J = {x1,X2,...,X,} consisting of n items x;, and their subsets (itemsets) 
X GI. We say that it is given a k-itemset, when |X| =k. Let D be a database of records 
(transactions) that are itemsets. 

We say that the record T € D is contributing to the itemset X, if ¥ & T. Association rule is an 
"if-then" type logical rule X => Y, the fulfillment of which is related to the certain (statistical) conditions. 
Let X and Y be itemsets where X N Y = 0. The ratio of the number of all records of D contributing to 
X and the overall number of records of D - is called support of X in D: 


sup(X) = [{T € D, X €T}I/|DI. 
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Next to this is the concept of support for the rule X = Y itself: 
sup(X => Y) = sup(X UY) = [{T €D,X UY €T}I/|DI. 
Another important property for the rules is the confidence that is defined as: 
conf(X = Y) = sup(X UY) /sup (xX), 
which is the conditional probability that a record contains Y when it is known that it contains X. 


APRIORI algorithm 


Practical implementation of association rule mining techniques is a subject of intensive 
theoretical and algorithmic studies. It is well known that the problem splits naturally into two stages 
[Agrawal et al., 1996]. The first stage is constructing the so called frequent fragments (itemsets), those 
that occur in the database with frequencies above the predetermined value of support. The second 
stage is actually the phase of synthesizing the rules with a given confidence from the set of frequent 
subsets constructed during the first stage of the algorithm. 


The well recognized algorithm in association rules mining APRIORI [Agrawal et al., 1996] builds 
the set of frequent subsets (first stage mentioned above) with so-called building up method. APRIORI 
first considers one-element subsets, and computes their frequencies. Next to this, it considers all two- 
element subsets one-element subsets of which are frequent, and verifies their proper occurrences in the 
database. Thus the frequent subsets can be building up to the state when it includes subsets that all are 
not frequent enough. This procedure is known as growing of frequent itemsets. Computational 
complexity here is significant and it is especially important because of algorithms will be used over the 
very large data volumes. 


Alternative approaches 


Are there any alternative approaches for building rules? There is a huge number of approaches, 
ideas and algorithms that address this issue. The case of quantitative attributes is much harder. General 
approaches applied are: rule interpretation as a union of parts - “population-subset” and “extraordinary- 
behavior” with consecutive optimization (maximization of support and/or confidence); design and 
consideration of interpretable regions such as attribute and multi-attribute “convex” regions; and - 
application of fast algorithms, for example randomized algorithms, OPUS, Kadane's, Elias’s and other 
algorithms. 

In this paper we propose a new approach to rule mining which connects with the well-known 
results from the geometry of the n-dimensional unit cube that corresponds to the Hansel’s algorithm for 
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monotone Boolean function identification. We give an extension of this approach to the multidimensional 
multivalued grid [Aslanyan et.al., 2017]. The extension will cover the case of rule mining by the sets of 
numerical attributes. 


Monotonicity property 


Monotonicity is the important property of function sup(X), 
(X © Y) > sup(X) = sup(Y). 

Monotonicity is the key property of the Boolean rule mining models, but it is not simply 
interpretable in the case of numerical attributes. Things changed when we try to speak in terms of 
attribute negations [Boulicaut, 2001]. This is when we aim at knowing the frequencies of conjunctions of 
all literals — that means attributes and their negations. In computational layer this doubles the number of 
attributes and forces the flooding of area of frequent subsets. [Boulicaut, 2001] and other publications 
derived formulas that introduce frequencies for negations in terms of formulas of positive frequencies. 
The issue of positive-negative attributes appears in case of quantitative attributes as the monotone-anti- 
monotone behavior. | Quantitative-monotone, in its turn, means that the increase of attribute values 
increases probability of the target event. For a particular target event we suppose that all individual 
attributes are monotonic and that they, in integration, also monotonically depend on the target event. 


Isoareas 


Isoareas of sup(X) in Figure 1 can be presented by the sequences of embedded monotone 
Boolean functions. But such sequences cannot behave arbitrarily. For example, if we consider a simple 
2-row database then these rows correspond to 2 vertices of n-cube, let they be denoted by v, and v7. 
Now consider the subcubes J,(v,,1) and I,(v,,1) formed by these vertices and by the 1 = 
(1,1, --:,1) vertex of the cube. All points of these subcubes 1-support one of the rows but the points in 
intersection support both of them. So it is not possible to design a database that gives homogeneous 
support value over the area covered by J, (v,,1) and I,(v2,1). So the mentioned “sequences of 
embedded monotone Boolean functions” are a specific inclusion-exclusion type objects to be described 
and studied in deep. 
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set | 
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empty set 


Figure 1. Hasse type diagram of power set of set I of attributes with isoareas of the values of function 
sup(X) 


Frequent Subsets Mining (FSSM) Problem 


Given: D — the database of subsets of some set A of attributes; and a frequency threshold t > 0. 
Provide: the t-frequent subsets in D. 

For the FSSM Problem several approaches are proposed: variations to the original APRIORI algorithm 
[Agrawal et al., 1996], as well as other approaches like [Aumann et al., 2003]. The most efficient 
algorithms are based on the observation that the frequent subsets are determined by the closed 
frequent subsets, and such algorithms need to mine the closed frequent subsets at first [Li et al., 2006]. 


A subset X € A Is called closed (maximal subset) if for each YCA, YX, it holds 
sup(Y) < sup(X). 


For X © A we define p(X) := N{Y © A: Y > Xand D(Y) > 0} to be the closure of X in 
D. It can be checked that X © p(X), X S Y > p(X) S p(Y) and p(p(X)) = p(X). It also can be 
checked that the closed subsets are the closures. 


Obviously D: 24 > Ny defines a monotonic decreasing integer function when X increases by 
the set-inclusion, over the Boolean cube (24, ), and it can be checked that the frequent subsets 
correspond to the subsets of the closed frequent subsets, and thereby closed frequent subsets 
determine the frequent subsets. 
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3. On the Geometry of the n-Dimensional Unit Cube 





n dimensional unite cube 


Variable with the only values 0 and 1 (false and true) is called a Boolean variable. n- 
dimensional Boolean function is a single-valued transformation of the set of all vectors composed by n 
Boolean variables on to the Boolean set B = {0,1}. The domain where the Boolean function is given is 
known as the set of vertices of the n dimensional unite cube B” that is the n-th Cartesian degree of the 
set B. B” is the set of all binary vectors @ = (a1, @z,...,@,), which are called vertices or points. 
Usually B” is presented geometrically via the Hasse diagram, in which vertices of B” are placed in 
horizontal layers; the k-th layer (0 < k < n) contains all the vertices with k number of ones. The 
layers are arranged vertically starting from the zero layer (at the bottom) to the layer with number . The 
k-th layer consists of CK vertices. Two vertices @ and are called adjacent if they differ in exactly one 
coordinate. These neighboring vertices are connected by an edge. 

Vertices of B” are organized as follows: a point @ = (a@,,@2,...,@,) of B” precedes the 

point 8 = (B,, Bz, --, By) of B™, if a; < B;, 1 <i <n. The fact that a point & precedes the point 7 
is denoted by & < f. If & < B and & # B then we write @ < f. Two different points @ and f are 
called comparable if one of the following conditions occur: & < B or B < &. Otherwise they are 
incomparable. 
It is evident, in general, that to uniquely identify a Boolean function it is necessary to know its values at 
all points of the n-dimensional unit cube. But if the function belongs to some specific class that is 
narrower than the set of all Boolean functions, then for the unique determination of its values at all 
points of B” is not necessary to know in advance the values of function at all points of B", and 
sometimes it is enough to know the values on a very small subset of vertices of B”. For example, to 
uniquely identify a symmetric Boolean function of n variables (these functions possess the same value 
on each layer of B”) it is enough to know its values on the set of points from B” which is intersecting all 
layers of B”. 


Monotone Boolean functions 


Boolean function f (x1, X, ., Xp») is called monotone if from the fact that @ < it implies that 
f (Q4, Hz, An) < f (By, B, ---, By). The class of all monotone Boolean functions of n variables is 
denoted by M,, . Some geometric properties of monotone Boolean functions are evident. To each 


function there is a unique set f° of incomparable vertices of B”, so that f(@) = 0 iff @ precedes one 


International Journal “Information Theories and Applications”, Vol. 24, Number 3, © 2017 209 


of the vertices of f°. Geometrically the area of 0 assignments of f is a union of subcubes, composed 
by the vertex 0 and the vertices of f°. Another important property is that on growing chains of vertices 
in B™, the function values - 0’s and 1’s fills two different intervals at most. 

Two type of recognition problems about the monotone Boolean functions are rising in different 
applications. One is the recognition whether the given f(X) belongs to M,,, the class of all monotone 
Boolean functions; and the second is in deciphering of f (X) itself given that f(x) € M,,. We address 
the second topic because of its identity to the problem of frequent itemset mining. 


Chain Split 


A separate group of algorithms for Boolean association rule mining is introduced in [Aslanyan et 
al., 2008]. 

Suppose that an arbitrary (unknown to us) function f(%) € M,, is given by an operator A, 
which returns the value f(@) by the given input @ € B”. Given the operator A, it is required to fully 
restore the set of values of the function f (X). After each call to the operator which resumes the value 
f (@) for the point @ = (a1, @>,...,@,) € B” other points of B” become determined through the 
extension by the monotonicity property. It is clear that we should strive for optimality of these algorithms 
that is to minimize the steps of applying to A. 

Consider the set R of all algorithms that solve this problem. That is, for a monotone Boolean 
function f (x1, x2, ...,X,) an algorithm from R exploiting the operator Ay restores the complete table of 
values of f(X). Obviously the work of algorithms consists of several stages. Algorithm selects a point 
a € B” and with help of operator A, computes the value f(a, @, ...,@) (selection). The resulting 





value of the function at the point @ is inserted into the table of computed values of the function. The 
table is extended by monotonicity, which includes determination of all points that can’t have 0 or 1 
values arbitrarily after knowing the value at @ (extension). For example if f(@) = 1 then for all points 





B that are higher that & (according to the order of vertices defined above) f(6) = 1 and the table of 
values of f is filled in accordingly. Next step is the rule that selects another input for operator A, and 
the table of values of f is filled again by monotonicity. This process is repeated until the table of values 
is filled completely. 
Obviously a pair <algorithm 7 € R and monotone function f (x1, Xz, ...,X,)> can be associated with 
the number ¢(r, f) of calls to the operator Ay during the recovery of table of values of the function 
Ff (%4, Xz, Xn) by the algorithm r. 

It is appropriate to evaluate the quality of the algorithms R using function w(R, f) = 
min, g(r, f). We have a condition: f € M,. The complexity of the recognition of a class of n- 
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dimensional monotone functions can be characterized by the function p(n) = v(R,M,,) = 
max, ~(R, f), where the maximum is taken over all monotone functions. 

Let us introduce some general terms on function deciphering [KOR,1965]. Suppose we are 
given a certain class N of Boolean functions and a function f, belonging to this class. The set of points 
G(f,N) from B” is called resolving set for the pair (f, N), if from the fact that 


a) the function g belongs to N, 
b) values of f and g are the same on the set G(f, N) it follows that f = g. 
To restore the table of values of functions it is sufficient to determine the values of function on 


some of its resolving sets. Resolving set G(f, N) is called a deadlock resolving set for (f, N), if no 
subset of it is resolving for the pair (f, NV). 

Let us denote by H(@) the set of points # satisfying the condition @ < B, and by L(@) - the 
set of points 7 such that 7 < @. 

The upper zero of monotone function f(x, x2,...,X,) is the point @ from B” such that 
f(@) = Oand f(f) = 1 forall points 6 € H(a). 

The lower one of a monotone function f (x1, X2, ...,X,) is a point @ such that f(@) = 1 and 
f (VY) = 0 for any point 7 € L(@). 

Let Z(f) denotes the set of all upper zeros of a monotone function f (x1, x2,...,%,), and 
O(f), - the set of all lower ones. Each monotone Boolean function has a unique deadlock resolving set 
that is included in its all resolving sets (mention that this is not the case for other classes, for instance in 
class of symmetric Boolean functions that we mentioned above). This deadlock resolving set for a 
monotone Boolean function is the set G(f) = Z(f) U O(f). 

A brief characterization of the chain split approach is as follows. The n-dimensional unit cube 
B” is a binary lattice consisting of 2” vertices that correspond to binary strings of length n, which are 
usually arranged in layers in the way that on the k-th layer there are all those vertices that have k units 
(1 values). Vertices that differ in one coordinate are called adjacent and are connected by an edge. 
Chain in B” is a sequence of adjacent vertices. A chain is called growing if it contains at most one 
vertex in one layer. 
G. Hansel [Hansel, 1966] showed that B” can be split into growing chains under certain conditions. 
Further, he considered the monotone Boolean functions and built an algorithm of optimal recognition of 
these functions using the constructed chains. Relationship of these constructions with the association 
rules are that frequent subsets with given parameters correspond to a set of zero value vertices of a 
monotone Boolean function. 

Direct use of this technique of Boolean function recognition is difficult because the constructing 
and storing the Hansel chains is a problem of algorithmic exponential complexity — in computation, and 
in memory used. 
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G. Tonoyan [Tonoyan, 1976] offered a computational approach to the work with chains. This is 
fundamentally and significantly simplifying the recognition algorithm although the complexity is still very 
high. The idea is in selecting one particular chain split in the collection of Hansel splits. Then a number 
of functions are introduced that map chains and their elements to each other. In total, this provides the 
necessary information to recognize monotone Boolean functions and eliminates the need in storing the 
complete structure of Hansel chains. This means sensitive economy of memory versus a small 
additional computation over the chain split. 

The global aim of this paper is to introduce the necessary chain split and computation technique 
in terms of problems of search of association rules in large databases and extend this technique to the 
mining problems of numerical attributes. Additionally, it is to take into account one more important 
feature of the problem for mining association rules. It is known that in data mining the number of 
considered elements, n, - is very large. It is also knowing that frequent subsets consist of relatively small 
number of elements. According to this an assumption occurs that there exists a value k such that all 
subsets above this power are not frequent. It turns out that the problem of search of frequent subsets is 
equivalent to decoding of a special class of monotone Boolean functions, which in turn requires an 
expansion of the results mentioned above for general Boolean functions, according to some restrictions 
of the set of functions considered. Extended results are introduced in terms of problem of frequent 
subsets synthesizing, thus providing the way of determining the set of all maximal (largest by inclusion) 
frequent subsets, without considering and constructing their sub-subsets. This avoids the part that 
particularly complicates the building up process. 





4. Quantitative Association Rule Mining 





In practice, many, if not most, databases contain quantitative data. Unfortunately, the 
definition of categorical association rules does not translate directly to the quantitative case. 
This initiates intensive search for a definition or model of association rules for the case of databases 
with quantitative attributes. [Srikant et al., 1996] made an approach to extend the categorical rule 
definition to include quantitative data. They used some kind of grouping and discretization in intervals of 
values of attributes. Thus, each basic event becomes either a categorical item or a range of numerical 
values. This way, although the base approach is powerful, raises a number of drawbacks in form of 
correct interval composition or as the exponential blowup of the number of the rules generated. 

[Fukuda et al., 1996a] considered a different perspective to quantitative association rule mining 
problem, providing efficient algorithms by the given values of minconf and minsup. They use 
computational geometry methods achieving efficiency even for very large size databases, but the rules 
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considered are plain: with one categorical attribute at the right side. The left side event of the rule have 
the form A € [v,, v2] with an attribute A and its values v,, v. The technique applied use randomized 
construction of intervals, combining them into the equi-depth buckets, then joining consecutive buckets 
into the regions of high support. The main target is to provide the required minconf, maximize the region 
support, and to keep the algorithms at linear complexity. 


Next approach to the problem of quantitative association rules is derived by [Aumann et al., 
2003]. The idea used is to compute and apply databases statistical values to increase rule 
interestingness and to combat the flooding of the number of generated rules. This paper well 
summarizes several data mining concepts. First of all, it brings the following useful description: An 
association rule indicates association between a subset of the population described by the left-hand 
side of the rule, and an extraordinary behavior of this subset described by the right-hand side of rule. 
[Webb, 2001] later ([Aumann et al., 2003] appeared as a conference publication at 1999) labelled these 
rules as impact rules to outline the difference to the approach [Srikant et al., 1996]. 


Thus, the introduced general structure of an association rule is of form: 


population-subset > extraordinary behavior. (1) 


In summary, an association rule considered is a rule of the form: “population-subset”=>”mean 
of values for the subset”, where population-subset is large enough and the mean of the subset is 
significantly different to the mean of its complement in the database to form an extraordinary event. And 
we indeed foresee the efficient algorithms that are able to generate the required rules of this type. The 
general structure (1) gives rise to many different concrete rule types, determined by the subset class 
used on the left-hand side, and the description of extraordinary used for the right-hand side. Continuing 
in this way [Aumann et al., 2003] constructed two types of rules: Categorical to Quantitative rules with 
an unlimited number of attributes on each side, and Quantitative to Quantitative where both sides 
contain a single attribute only. It is evident that still these are narrow rule sets. Information on other 
work on quantitative rule mining may be found at [Aumann et al., 2003] and [Hahsler, 2017]. In 
particular, [Fukuda et al., 1996b] and [Yoda et al., 1997] consider rules, from 2 numerical to one 
Boolean attributes, where numerical attributes construct a connected and x-monotone, rectangular, or 
rectilinear areas (in terms of discrete tomography x-monotone means v-convex, and rectilinear means 
hv-convex). The whole diversity of studies, as we see, consider a large number of very restricted types 
of rules, and models, that rarely provide effective computation of the limited number of properly 
optimized rules. Another concern is that these quantitative approaches loose the main essence of the 
Boolean rule mining as is the frequent itemset growing. 
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Let us mention also QuantMiner [Salleb-Aouissi et al., 2007] that is a Quantitative Association 
Rules Mining tool available online. It takes into consideration a set of numerical attributes in the mining 
process without a prior binning/discretization of the data. It exploits a recent and innovative research in 
genetic algorithms. 


Concluding, 


In this context we aim at constructing (multi&numerical) = Binary rules under the unique 
supposition of attribute monotonicity. Monotonicity is a natural phenomenon not absolute but it 
can be supposed for any kind of extraordinary behavior. In exceptional cases the attribute value 
domain can be split in several intervals that will support monotonicity. All we suppose is that 
(*) | these cases of split for monotonicity are only exceptions and that the individual attribute 
monotonicity integrates in a collective monotonicity of the model. As a consequence, we will 
receive a model based on frequent itemset growing technique. And we will extend the Hansel’s 
chain split technique and the monotone Boolean function recognition as the algorithmic basis of 
our newly quantitative association rule mining model. 





Multi-valued cube splitting to unit-cubes 


Let 7 = (m4,™mMz,...,M,) be an integer vector of n dimensions, and =% be the set of 
vertices of the nm coordinate discrete grid defined as the Cartesian product of sets £,,, = 
{0,1, ...,m; — 1}: 

CS Ee K Sys Rte Sa SG Open Oa)! LS Sale 10}, 

In this section we introduce a special decomposition of E% into the structures isomorphic to binary 
cubes. 

Binary cubes may have different dimensions but their distribution by the cube-size is canonical and the 
cubes in total cover the ©} disjointly and entirely. Before descriptions we distinguish several type of 


special vertices in EK. 


Middle vertices 


Vertices fimias = (=). [2]. [Zp and fimia_ = (=|, [=]... [Zp we call 


middle vertices of 2%. These two vectors coincide when all m; are even values. Being skewed because 
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of the possible differences of values m,, mz, ..., Mn, =, has exactly the mentioned one or two special 
points at the center. Even with one odd m; the central points are different. Let n,. denote the number of 
all odd m; values. The set of all points with coordinates mMimia+ = = and/or Mimia— = |= are 


allocated between the 771,;44 and 17,;q-. Number of such vertices is equal to 2”*. These points fill a 
structure isomorphic to B"*. Symmetrically, let us denote n — nz by ne. 


Upper vertices, lower vertices 


A vertex (a1, @z,...,@,) of E% is called upper vertex if (ay, 2, ..-,An) = Mmig4. Similarly, 
vertex (a1, Qz,...,A,) of Z% is called lower vertex if (a,,@2,..,€n) < Mmjq—-. 2 and = denote the 


sets of all upper and lower vertices of 2%, correspondingly. It is easy to check, that 


= [izi(lm;/2] + 1). 














Vertical equivalence 


Vertices @ = (a, 2, ...,@,) and b = (bj, bo, ..., by) of 2%, are called vertically equivalent if 
a; € {b;,m; — b;} for1 <i <n. It is easy to check that this condition is to symmetrically apply to a 
and b, creating a structure of equivalence classes over the 2%. Let V(@) denote the class of V- 
equivalence of vertex a. In V(@) we distinguish two vertices @ and d@ with coordinates defined as 
follows: 


Aides aj if 4; = Mimia+ 
L ——4 . 
m—a, if aj S Mimia- 


a ‘ie —a;, if aj 2 Mimiat 
L —- . 
aj if 4 S Mimia- 


Vertices @ and é are the only “two” vertices for an arbitrary Va) that belong to = and =, respectively. 
Thus all vertices of sets V(@) can be extended from the upper and/or lower elements of the class of V- 
equivalency by component subset inversions (in respect to values m;). It is evident that the equivalence 


A 

Lal 
4 

— 








classes of different vertices of = (or =) are disjoint. This construction provides partitioning of 27 into 


equivalence classes uniquely defined by the elements of ©. 
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For a given @ € E% define an integer k, = |{a;: a; # m;/2}|. Then |V(@)| = 2**. We 
identify each vertex 6 € V(@) with an n-dimensional binary sequence y, such that 7; = 1 if and only if 
B; = @;. \n this manner, V(@) becomes isomorphic to the k-dimensional binary cube B*=: the 0-th 
level contains the lower vertex of V(@) belonging to =; the i-th level consists of all vertices of V(a) 
which can be obtained from the lower vertex by applying i number of component inversions. 


A 

onl 
4 

— 


Thus EF, is partitioned into 








disjoint equivalence classes - that are identical in structure to 
the binary cubes. It is worth to mention that in usual chain-split (as is the partition of the binary cube in 
[Hansel, 1966]) vertices in chains are composed of neighbor vertices, whereas in the case of cube-split 
edges of the chains connect, in general, vertices that do not belong to the neighbor layers of E}.. 


Let us obtain the general description of the collection of all V-equivalence clusters of =}. If all 


A 

al 
4 

— 








m, are odd, sizes of all |=] subcubes of partition are equal to 2”. 


When all m; are even, this is the case of the unique middle point, and the arbitrary vertex 
G@ € EX may have any given number k, < n of coordinates that are different from m;/2. Volumes of 
subcubes corresponding to such points @ € E% is 2**. The number of all mentioned points with k,. 
“un-concentrated” coordinates may be calculated in the following way. It is to construct all different n- 
vectors that have n — k, coordinates equal to m,;/2 and the reminder ones accept all feasible 
assignments. For i and m; number of such independent evaluations equals to m;/2 taking into 


A 

al 
» 

rar 


account that it is to consider the part of upper vertices, those from |]. For one collection of fixed 








n — k, coordinates we receive production of terms m;/2 by the set of coordinates out of the n — kz, 
that accepted values m;/2. The total number of k., different upper vertices is equal to the sum of 
products by all elements of k,, products of terms m;/2, and it involves all selections of the k, 
collections. Denote this number by @(m, kz), 


em, ky) = Hicm, [=kz e(m™, I)= Hicm, [=kz [mer m;/2. 


It is easy to check that the combinatorial generating function of these numbers is 


gm) = TT. + (m;/2)x). (2) 
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To obtain the value of @(m, k,.) it is to maintain the expression of g(77) analytically, taking 
the coefficient at the x*=. 


Let us also bring the formula 


Dx, PCM, kz )2** = TiC; + 1), (3) 


which is some kind of check of the structural description of cube split @(m,k.). @(m, k) is the 
coefficient at x** of G (7) in formula (2). Taking x = 2 in (2) we multiply the coefficients @(7, k,.) by 
2%, that gives the individual summand of the left side formula of (3). Substitution x = 2 in (2) directly, 
gives the right side formula of (3) proving the check. 


Thus we considered two special cases of 7, one with all odd and second with all even 
coordinates. The general case that may have as odd as well even coordinates, may be analyzed by a 
simple integration of these two subcases. 


Cube-Split Association Rule Mining (CSARM) Algorithm 


At this point we suppose that we are given a relational table T with n attributes: A,, Az, ..., Ap. 
The row collection composes a stream of transactions where Boolean and quantitative attributes are 
applied together. For an arbitrary categorical attribute, when it is necessary, an additional set of Boolean 
attributes may be generated as indicators of the different value domains of that attribute, but we 
suppose that the table T is already in mixed form of Boolean and quantitative attributes and it will not be 
changed structurally. As we mentioned above, we will consider rules of tyoe A > B where the left side 
attributes are monotonically related to the target event B. Our next supposition (secondary but 
important) is the notion that all transactions are sparse, which means that attributes involved with 
positive (none empty) values are a very limited share of the whole set of attributes. In traditional problem 
of supermarket basket analysis, the entire number of items in supermarket is of course very large, but 
each individual purchase basket contains only a very limited number of items. In these suppositions our 
association rule mining procedure reduces to the monotone binary function recognition algorithm. 

To complete the descriptions of CSARM model, it reminds to redirect the reader to the works 
[Aslanyan et al., 2017] and [Aslanyan et al., 2008] that describe the Boolean monotone recognition in 
details. The basic application of the framework of the Cube-and-Chain Split Data Mining was with help 
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of the new versions of Intrusion Detection system SPARTA [Aslanyan et al., 2011]. SPARTA is mining 
sets of LOG files in a computer system to determine the nonstandard and extraordinary behavior such 
as the intrusion into the system. We just need to bring the general algorithmic framework of the CSARM 


system. 
Let f: 2% — {0,1} be a monotone function defined with the help of an oracle Or. 


CSARM algorithm at first splits 2% into 


CSARM maps the monotone binary function f onto the constructed binary cubes. This procedure 








vertical equivalence classes. At second stage 


produces a large set of monotone Boolean functions that will be recognized with the help of Hansel’s 
algorithm. 


The final step of CSARM is for integration of the fragments of recognition into a general structure of the 
frequent subsets of the attributes. 


CSARM Algorithm. 


> Partition of 27, into the set of vertical equivalence classes: V;,V2,---,Vjg), and compose the 
corresponding binary subcubes: E(V,), EV), EV) as it is described in previous 
sections (cube split). 

> In every cube E(V;) consider the binary function f;:E(V;) — {0,1} , defined as follows: 
fi(B) = 1 if and only if f(b) = 1, for every B € E(V;), where b is the origin of B in=ER . f; 
is monotone and is given with the help of Q;. Apply the Hansel’s chain split method for the 
recognition f;, 

> Integrate the results of 


aA 
Lal 
— 








binary recognitions procedures to obtain the target function f. 


Let @,(n) be the minimal number of queries which is sufficient for recognizing arbitrary 
monotone function of n variables defined on 27, by the CSARM Algorithm. Then: 


k 
pan) = Lk=o (ac k)- (cF + aes) for all m; even, and 


,(n) = Ti (7) ic + cl" ) for all m, odd. 


2 


Thus we obtained formulas for @,(7) in two special cases of m, one with all odd and second 
with all even coordinates. The general case that may have as odd as well even coordinates, may be 
analyzed by a simple integration of these two subcases. 
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